NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores

Valpey, Benjamin; Li, Xinyi; Pai, Sreepathi; Gopalakrishnan, Ganesh (June 2025, NASA Formal Methods)
Titolo, Laura (Ed.)
Many recent computational accelerators provide non-standard (e.g., reduced precision) arithmetic operations to enhance performance for floating-point matrix multiplication. Unfortunately, the properties of these accelerators are not widely understood and lack sufficient descriptions of their behavior. This makes it difficult for tool builders beyond the original vendor to target or simulate the hardware correctly, or for algorithm designers to be confident in their code. To address these gaps, prior studies have probed the behavior of these units with manually crafted tests. Such tests are cumbersome to design, and adapting them as the accelerators evolve requires repeated manual effort. We present a formal model for the tensor cores of NVIDIA’s Volta, Turing, and Ampere GPUs. We identify specific properties—rounding mode, precision, and accumulation order—that drive these cores’ behavior. We formalize these properties and then use the formalization to automatically generate discriminating inputs that illustrate differences among machines. Our results confirm many of the findings of previous tensor core studies, but also identify subtle disagreements. In particular, NVIDIA’s machines do not, as previously reported, use round-to-zero for accumulation, and their 5-term accumulator requires 3 extra carry-out bits for full accuracy. Using our formal model, we analyze two existing algorithms that use half-precision tensor cores to accelerate single-precision multiplication with error correction. Our analysis reveals that the newer algorithm, designed to be more accurate than the first, is actually less accurate for certain inputs.
more » « less
Full Text Available
An SMT Formalization of Mixed-Precision Matrix Multiplication: Modeling Three Generations of Tensor Cores

https://doi.org/10.1007/978-3-031-93706-4_21

Valpey, Benjamin; Li, Xinyi; Pai, Sreepathi; Gopalakrishnan, Ganesh (January 2025, Springer Nature Switzerland)

Full Text Available
A Fast, General System for Buffered Persistent Data Structures

https://doi.org/10.1145/3472456.3472458

Wen, Haosen; Cai, Wentao; Du, Mingzhe; Jenkins, Louis; Valpey, Benjamin; Scott, Michael L. (August 2021, 50 Intl. Conf. on Parallel Processing (ICPP))
null (Ed.)
The emergence of fast, dense, nonvolatile main memory suggests that certain long-lived data might remain in their natural pointerrich format across program runs and hardware reboots. Operations on such data must currently be instrumented with explicit writeback and fence instructions to ensure consistency in the wake of a crash. Techniques to minimize the cost of this instrumentation are an active topic of research. We present what we believe to be the first general-purpose approach to building buffered persistent data structures, and a system, Montage, to support that approach. Montage is built on top of the Ralloc nonblocking persistent allocator. It employs a millisecondgranularity epoch clock, and ensures that no operation appears to span an epoch boundary. It also arranges to persist only that data minimally required to reconstruct the structure after a crash. If a crash occurs in epoch e, all work performed in epochs e and e − 1 is lost, but work from prior epochs is preserved, consistently. As in traditional file and database systems, a sync operation can be used to flush buffers on demand; the Montage sync is extremely fast. We describe the implementation of Montage, argue its correctness, and report unprecedented throughput for persistent queues, sets/mappings, and general graphs.
more » « less
Full Text Available
Building Fast Recoverable Persistent Data Structures with Montage

https://doi.org/10.4230/LIPIcs.DISC.2020.52

Wen, Haosen; Cai, Wentao; Du, Mingzhe; Valpey, Benjamin; Scott, Michael L. (October 2020, Leibniz international proceedings in informatics)
null (Ed.)
The recent emergence of fast, dense, nonvolatile main memory suggests that certain long-lived data structures might remain in their natural, pointer-rich format across program runs and hardware reboots. Operations on such structures must be instrumented with explicit write-back and fence instructions to ensure consistency in the wake of a crash. Techniques to minimize the cost of this instrumentation are an active topic of current research. We present what we believe to be the first general-purpose approach to building buffered durably linearizable persistent data structures, and a system, Montage, to support that approach. Montage is built on top of the Ralloc nonblocking persistent allocator. It employs a slow-ticking epoch clock, and ensures that no operation appears to span an epoch boundary. If a crash occurs in epoch e, all work performed in epochs e and e-1 is lost, but all work from prior epochs is preserved. We describe the implementation of Montage, argue its correctness, and report on experiments confirming excellent performance for operations on queues, sets/mappings, and general graphs.
more » « less
Full Text Available

Search for: All records